{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ejercicio Feature Engineering\n",
"\n",
"### Reemplace todos los valores nulos o vacíos del archivo csv pertinente\n",
"\n",
"Este ejercicio consiste en cargar un archivo csv que tiene valores nulos en más de una columna. Es requerido reemplazar los valores nulos por el mejor valor posible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. [Importar las librerias y cargar el archivo csv](#1)\n",
"2. [Visualizar que datos son los que faltan](#2)\n",
"3. [Ver los filas de los datos faltantes y la matriz de correlación](#3)\n",
"4. [Reemplazar los valores nulos de la columna mpg](#4)\n",
"5. [(Opcional) Reemplazar los valores nulos de mpg utilizando funciones programadas por mí](#5)\n",
"6. [Reemplazar los valores nulos de la columna caballos_potencia](#6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Importar las librerias y cargar el archivo csv\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import math\n",
"\n",
"autos = pd.read_csv(os.path.join('./csv/', 'datos_automoviles.csv'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Visualizar que datos son los que faltan\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" nombre | \n",
" cilindros | \n",
" peso | \n",
" anio | \n",
" territorio | \n",
" aceleracion | \n",
" mpg | \n",
" caballos_potencia | \n",
" desplazamiento | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 406 | \n",
" 406.000000 | \n",
" 406.000000 | \n",
" 406.000000 | \n",
" 406 | \n",
" 406.000000 | \n",
" 398.000000 | \n",
" 400.000000 | \n",
" 406.000000 | \n",
"
\n",
" \n",
" unique | \n",
" 312 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" Ford Pinto | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" USA | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" 6 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 254 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" NaN | \n",
" 5.475369 | \n",
" 2979.413793 | \n",
" 1975.921182 | \n",
" NaN | \n",
" 15.519704 | \n",
" 23.514573 | \n",
" 105.082500 | \n",
" 194.779557 | \n",
"
\n",
" \n",
" std | \n",
" NaN | \n",
" 1.712160 | \n",
" 847.004328 | \n",
" 3.748737 | \n",
" NaN | \n",
" 2.803359 | \n",
" 7.815984 | \n",
" 38.768779 | \n",
" 104.922458 | \n",
"
\n",
" \n",
" min | \n",
" NaN | \n",
" 3.000000 | \n",
" 1613.000000 | \n",
" 1970.000000 | \n",
" NaN | \n",
" 8.000000 | \n",
" 9.000000 | \n",
" 46.000000 | \n",
" 68.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" NaN | \n",
" 4.000000 | \n",
" 2226.500000 | \n",
" 1973.000000 | \n",
" NaN | \n",
" 13.700000 | \n",
" 17.500000 | \n",
" 75.750000 | \n",
" 105.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" NaN | \n",
" 4.000000 | \n",
" 2822.500000 | \n",
" 1976.000000 | \n",
" NaN | \n",
" 15.500000 | \n",
" 23.000000 | \n",
" 95.000000 | \n",
" 151.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" NaN | \n",
" 8.000000 | \n",
" 3618.250000 | \n",
" 1979.000000 | \n",
" NaN | \n",
" 17.175000 | \n",
" 29.000000 | \n",
" 130.000000 | \n",
" 302.000000 | \n",
"
\n",
" \n",
" max | \n",
" NaN | \n",
" 8.000000 | \n",
" 5140.000000 | \n",
" 1982.000000 | \n",
" NaN | \n",
" 24.800000 | \n",
" 46.600000 | \n",
" 230.000000 | \n",
" 455.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" nombre cilindros peso anio territorio \\\n",
"count 406 406.000000 406.000000 406.000000 406 \n",
"unique 312 NaN NaN NaN 3 \n",
"top Ford Pinto NaN NaN NaN USA \n",
"freq 6 NaN NaN NaN 254 \n",
"mean NaN 5.475369 2979.413793 1975.921182 NaN \n",
"std NaN 1.712160 847.004328 3.748737 NaN \n",
"min NaN 3.000000 1613.000000 1970.000000 NaN \n",
"25% NaN 4.000000 2226.500000 1973.000000 NaN \n",
"50% NaN 4.000000 2822.500000 1976.000000 NaN \n",
"75% NaN 8.000000 3618.250000 1979.000000 NaN \n",
"max NaN 8.000000 5140.000000 1982.000000 NaN \n",
"\n",
" aceleracion mpg caballos_potencia desplazamiento \n",
"count 406.000000 398.000000 400.000000 406.000000 \n",
"unique NaN NaN NaN NaN \n",
"top NaN NaN NaN NaN \n",
"freq NaN NaN NaN NaN \n",
"mean 15.519704 23.514573 105.082500 194.779557 \n",
"std 2.803359 7.815984 38.768779 104.922458 \n",
"min 8.000000 9.000000 46.000000 68.000000 \n",
"25% 13.700000 17.500000 75.750000 105.000000 \n",
"50% 15.500000 23.000000 95.000000 151.000000 \n",
"75% 17.175000 29.000000 130.000000 302.000000 \n",
"max 24.800000 46.600000 230.000000 455.000000 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Ver los filas de los datos faltantes y la matriz de correlación\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" nombre | \n",
" cilindros | \n",
" peso | \n",
" anio | \n",
" territorio | \n",
" aceleracion | \n",
" mpg | \n",
" caballos_potencia | \n",
" desplazamiento | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" Citroen Ds-21 Pallas | \n",
" 4 | \n",
" 3090 | \n",
" 1970 | \n",
" Europe | \n",
" 17.5 | \n",
" NaN | \n",
" 115.0 | \n",
" 133.0 | \n",
"
\n",
" \n",
" 11 | \n",
" Chevrolet Chevelle Concours (Sw) | \n",
" 8 | \n",
" 4142 | \n",
" 1970 | \n",
" USA | \n",
" 11.5 | \n",
" NaN | \n",
" 165.0 | \n",
" 350.0 | \n",
"
\n",
" \n",
" 12 | \n",
" Ford Torino (Sw) | \n",
" 8 | \n",
" 4034 | \n",
" 1970 | \n",
" USA | \n",
" 11.0 | \n",
" NaN | \n",
" 153.0 | \n",
" 351.0 | \n",
"
\n",
" \n",
" 13 | \n",
" Plymouth Satellite (Sw) | \n",
" 8 | \n",
" 4166 | \n",
" 1970 | \n",
" USA | \n",
" 10.5 | \n",
" NaN | \n",
" 175.0 | \n",
" 383.0 | \n",
"
\n",
" \n",
" 14 | \n",
" Amc Rebel Sst (Sw) | \n",
" 8 | \n",
" 3850 | \n",
" 1970 | \n",
" USA | \n",
" 11.0 | \n",
" NaN | \n",
" 175.0 | \n",
" 360.0 | \n",
"
\n",
" \n",
" 17 | \n",
" Ford Mustang Boss 302 | \n",
" 8 | \n",
" 3353 | \n",
" 1970 | \n",
" USA | \n",
" 8.0 | \n",
" NaN | \n",
" 140.0 | \n",
" 302.0 | \n",
"
\n",
" \n",
" 39 | \n",
" Volkswagen Super Beetle 117 | \n",
" 4 | \n",
" 1978 | \n",
" 1971 | \n",
" Europe | \n",
" 20.0 | \n",
" NaN | \n",
" 48.0 | \n",
" 97.0 | \n",
"
\n",
" \n",
" 367 | \n",
" Saab 900S | \n",
" 4 | \n",
" 2800 | \n",
" 1981 | \n",
" Europe | \n",
" 15.4 | \n",
" NaN | \n",
" 110.0 | \n",
" 121.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" nombre cilindros peso anio territorio \\\n",
"10 Citroen Ds-21 Pallas 4 3090 1970 Europe \n",
"11 Chevrolet Chevelle Concours (Sw) 8 4142 1970 USA \n",
"12 Ford Torino (Sw) 8 4034 1970 USA \n",
"13 Plymouth Satellite (Sw) 8 4166 1970 USA \n",
"14 Amc Rebel Sst (Sw) 8 3850 1970 USA \n",
"17 Ford Mustang Boss 302 8 3353 1970 USA \n",
"39 Volkswagen Super Beetle 117 4 1978 1971 Europe \n",
"367 Saab 900S 4 2800 1981 Europe \n",
"\n",
" aceleracion mpg caballos_potencia desplazamiento \n",
"10 17.5 NaN 115.0 133.0 \n",
"11 11.5 NaN 165.0 350.0 \n",
"12 11.0 NaN 153.0 351.0 \n",
"13 10.5 NaN 175.0 383.0 \n",
"14 11.0 NaN 175.0 360.0 \n",
"17 8.0 NaN 140.0 302.0 \n",
"39 20.0 NaN 48.0 97.0 \n",
"367 15.4 NaN 110.0 121.0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos.loc[autos['mpg'].isnull()]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" nombre | \n",
" cilindros | \n",
" peso | \n",
" anio | \n",
" territorio | \n",
" aceleracion | \n",
" mpg | \n",
" caballos_potencia | \n",
" desplazamiento | \n",
"
\n",
" \n",
" \n",
" \n",
" 38 | \n",
" Ford Pinto | \n",
" 4 | \n",
" 2046 | \n",
" 1971 | \n",
" USA | \n",
" 19.0 | \n",
" 25.0 | \n",
" NaN | \n",
" 98.0 | \n",
"
\n",
" \n",
" 133 | \n",
" Ford Maverick | \n",
" 6 | \n",
" 2875 | \n",
" 1974 | \n",
" USA | \n",
" 17.0 | \n",
" 21.0 | \n",
" NaN | \n",
" 200.0 | \n",
"
\n",
" \n",
" 337 | \n",
" Renault Lecar Deluxe | \n",
" 4 | \n",
" 1835 | \n",
" 1980 | \n",
" Europe | \n",
" 17.3 | \n",
" 40.9 | \n",
" NaN | \n",
" 85.0 | \n",
"
\n",
" \n",
" 343 | \n",
" Ford Mustang Cobra | \n",
" 4 | \n",
" 2905 | \n",
" 1980 | \n",
" USA | \n",
" 14.3 | \n",
" 23.6 | \n",
" NaN | \n",
" 140.0 | \n",
"
\n",
" \n",
" 361 | \n",
" Renault 18I | \n",
" 4 | \n",
" 2320 | \n",
" 1981 | \n",
" Europe | \n",
" 15.8 | \n",
" 34.5 | \n",
" NaN | \n",
" 100.0 | \n",
"
\n",
" \n",
" 382 | \n",
" Amc Concord Dl | \n",
" 4 | \n",
" 3035 | \n",
" 1982 | \n",
" USA | \n",
" 20.5 | \n",
" 23.0 | \n",
" NaN | \n",
" 151.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" nombre cilindros peso anio territorio aceleracion \\\n",
"38 Ford Pinto 4 2046 1971 USA 19.0 \n",
"133 Ford Maverick 6 2875 1974 USA 17.0 \n",
"337 Renault Lecar Deluxe 4 1835 1980 Europe 17.3 \n",
"343 Ford Mustang Cobra 4 2905 1980 USA 14.3 \n",
"361 Renault 18I 4 2320 1981 Europe 15.8 \n",
"382 Amc Concord Dl 4 3035 1982 USA 20.5 \n",
"\n",
" mpg caballos_potencia desplazamiento \n",
"38 25.0 NaN 98.0 \n",
"133 21.0 NaN 200.0 \n",
"337 40.9 NaN 85.0 \n",
"343 23.6 NaN 140.0 \n",
"361 34.5 NaN 100.0 \n",
"382 23.0 NaN 151.0 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos.loc[autos['caballos_potencia'].isnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Se quiere saber que variable(columna) tiene la mejor correlación con la variable que tiene valores nulos\n",
"en este caso, la variable que tiene la mejor correlación con mpg es peso.\n",
"\n",
"Pero, como peso puede adqurir múltiples valores es mejor utilizar a la variable cilindros,\n",
"porque cilindros sólo puede adqurir pocos valores, por tanto, el valor de la mediana será más preciso."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cilindros | \n",
" peso | \n",
" anio | \n",
" aceleracion | \n",
" mpg | \n",
" caballos_potencia | \n",
" desplazamiento | \n",
"
\n",
" \n",
" \n",
" \n",
" cilindros | \n",
" 1.000000 | \n",
" 0.895220 | \n",
" -0.360762 | \n",
" -0.522452 | \n",
" -0.775396 | \n",
" 0.844158 | \n",
" 0.951787 | \n",
"
\n",
" \n",
" peso | \n",
" 0.895220 | \n",
" 1.000000 | \n",
" -0.315389 | \n",
" -0.430086 | \n",
" -0.831741 | \n",
" 0.866586 | \n",
" 0.932475 | \n",
"
\n",
" \n",
" anio | \n",
" -0.360762 | \n",
" -0.315389 | \n",
" 1.000000 | \n",
" 0.301992 | \n",
" 0.579267 | \n",
" -0.424419 | \n",
" -0.381714 | \n",
"
\n",
" \n",
" aceleracion | \n",
" -0.522452 | \n",
" -0.430086 | \n",
" 0.301992 | \n",
" 1.000000 | \n",
" 0.420289 | \n",
" -0.697124 | \n",
" -0.557984 | \n",
"
\n",
" \n",
" mpg | \n",
" -0.775396 | \n",
" -0.831741 | \n",
" 0.579267 | \n",
" 0.420289 | \n",
" 1.000000 | \n",
" -0.778427 | \n",
" -0.804203 | \n",
"
\n",
" \n",
" caballos_potencia | \n",
" 0.844158 | \n",
" 0.866586 | \n",
" -0.424419 | \n",
" -0.697124 | \n",
" -0.778427 | \n",
" 1.000000 | \n",
" 0.898326 | \n",
"
\n",
" \n",
" desplazamiento | \n",
" 0.951787 | \n",
" 0.932475 | \n",
" -0.381714 | \n",
" -0.557984 | \n",
" -0.804203 | \n",
" 0.898326 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cilindros peso anio aceleracion mpg \\\n",
"cilindros 1.000000 0.895220 -0.360762 -0.522452 -0.775396 \n",
"peso 0.895220 1.000000 -0.315389 -0.430086 -0.831741 \n",
"anio -0.360762 -0.315389 1.000000 0.301992 0.579267 \n",
"aceleracion -0.522452 -0.430086 0.301992 1.000000 0.420289 \n",
"mpg -0.775396 -0.831741 0.579267 0.420289 1.000000 \n",
"caballos_potencia 0.844158 0.866586 -0.424419 -0.697124 -0.778427 \n",
"desplazamiento 0.951787 0.932475 -0.381714 -0.557984 -0.804203 \n",
"\n",
" caballos_potencia desplazamiento \n",
"cilindros 0.844158 0.951787 \n",
"peso 0.866586 0.932475 \n",
"anio -0.424419 -0.381714 \n",
"aceleracion -0.697124 -0.557984 \n",
"mpg -0.778427 -0.804203 \n",
"caballos_potencia 1.000000 0.898326 \n",
"desplazamiento 0.898326 1.000000 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Reemplazar los valores nulos de la columna mpg\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"cilindros_agrupacion = autos.groupby('cilindros')\n",
"\n",
"# Esto crea una serie del mismo tamaño del DataFrame que contiene el valor de la mediana de las mpg con respecto\n",
"# a los cilindros.\n",
"mediana_mpg_por_cilindro = cilindros_agrupacion['mpg'].transform('median')\n",
"\n",
"# Reemplazar los valores nulos de mpg con la mediana obteniad anteriormente\n",
"autos['mpg'].fillna(mediana_mpg_por_cilindro, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. (Opcional) Reemplazar los valores nulos de mpg utilizando funciones programadas por mí\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"cilindros_agrupacion = autos.groupby('cilindros')\n",
"# Crear una serie que tenga como índices los cilindros y como valor la mediana de las mpg seǵun los cilindros\n",
"mediana_mpg_segun_cilindros = cilindros_agrupacion['mpg'].median()\n",
"\n",
"# Funcion que recibe una fila y si tiene la columna mpg como nan le pone el valor de la mediana de las mpg\n",
"# según el cilindro\n",
"def poner_mediana_en_mpg_nulos(fila):\n",
" if math.isnan(fila['mpg']):\n",
" fila['mpg'] = mediana_mpg_segun_cilindros[fila['cilindros']]\n",
" return fila\n",
"\n",
"# Aplicar a cada fila del data frame la funcion definida arriba\n",
"autos_con_mpg = autos.apply(poner_mediana_en_mpg_nulos, axis=1)\n",
"\n",
"# La misma funcion pero utilizando cálculo lambda\n",
"#autos_con_mpg = autos.apply(lambda x: mediana_mpg_segun_cilindros[x['cilindros']] if math.isnan(x['mpg']) else x, axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Reemplazar los valores nulos de la columna caballos_potencia\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# El procedimiento es el mismo que el de reemplazar los valores nulos de mpg\n",
"mediana_caballos_potencia_segun_cilindros = cilindros_agrupacion['caballos_potencia'].transform('median')\n",
"autos['caballos_potencia'].fillna(mediana_caballos_potencia_segun_cilindros, inplace=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}